sinks: walk the input arrangement via cursor per batch by DAlperin · Pull Request #36165 · MaterializeInc/materialize

DAlperin · 2026-04-20T21:13:53Z

Summary

Refactor the sink rendering path so sinks walk their input arrangement's cursors directly, eliminating the Vec<DiffPair<Row>>-per-group materialization and the trace-reader overhead that zip_into_diff_pairs → combine_at_timestamp → flat_map previously incurred. Motivated by profiles of very large snapshot sinks where that pipeline dominated allocation and caused major page faults.

What this buys

No per-(key, time) Vec<DiffPair<Row>> allocation. Biggest direct win — this was the dominant allocator in zip_into_diff_pairs on the profile.
Key owned once per key, not per (key, time) group.
Fewer operator boundaries. Kafka: arrangement → encode (was: arrangement → combine_at_timestamp → flat_map → encode). Iceberg: one fewer hop too.
Spine compacts aggressively. Dropping the TraceAgent lets spine compaction advance its frontier, releasing historical batch state rather than accumulating it.

What this does not fix

The arrangement's pre-spine batcher still buffers un-sealed updates (unavoidable at this layer).
For a pure-insertion snapshot with no retractions, spine compaction has limited work to do — savings come primarily from the Vec/operator overhead, not from the spine itself shrinking.
Persist snapshot forwarding semantics and sink commit-on-frontier semantics are unchanged.

Follow-ups (not in this PR) worth considering:

A batcher-only operator that skips the spine entirely for sinks that don't need a trace.
An append-mode Iceberg fast-path that skips arrangement altogether when no key is configured.

Test plan

cargo check --workspace --all-targets
cargo clippy -p mz-interchange -p mz-storage --tests
bin/lint (check-no-diff fails locally due to jj colocation; all substantive checks pass)
cargo test -p mz-interchange --lib envelopes:: (7 tests, all pass)
Full CI (pending)
Kafka sink testdrive (snapshot + steady-state, Upsert and Debezium envelopes)
Iceberg sink testdrive (snapshot + steady-state, Upsert and Append envelopes)
Re-profile a very large snapshot sink to confirm the `zip_into_diff_pairs` hotspot is gone

🤖 Generated with Claude Code

Extracts the batch-level cursor walking logic from combine_at_timestamp into a standalone helper that yields DiffPairs one at a time via callback. This lets sinks consume an arrangement directly inside their own operator without materializing a Vec<DiffPair> per (key, timestamp) group. combine_at_timestamp is left in place; subsequent commits will migrate the Kafka and Iceberg sinks to consume arrangements via the new helper.

def-

When a Kafka or Iceberg sink is configured with a non-unique key (KEY (...) NOT ENFORCED), Materialize is supposed to log a warning ("primary key error") whenever multiple rows share the same key at the same timestamp. With this PR the warning doesn't appear when no new data arrives afterwards. Not sure how bad that is.

Reproducer:

diff --git a/test/source-sink-errors/mzcompose.py b/test/source-sink-errors/mzcompose.py
index ac2b7c3e92..5a9bbeb7ee 100644
--- a/test/source-sink-errors/mzcompose.py
+++ b/test/source-sink-errors/mzcompose.py
@@ -13,6 +13,7 @@ Disruption and then checking the mz_internal.mz_*_statuses tables
 """

 import random
+import time
 from collections.abc import Callable
 from dataclasses import dataclass
 from textwrap import dedent
@@ -561,3 +562,52 @@ def unsupported_pg_table(c: Composition) -> None:
                  $ postgres-execute connection=postgres://postgres:postgres@postgres
                  INSERT INTO source1 VALUES (3, '[2:3]={2,2}')
                  """))
+
+
+def workflow_test_pk_violation_warning(c: Composition) -> None:
+    seed = random.randint(0, 256**4)
+    c.up("redpanda", "materialized", Service("testdrive", idle=True))
+
+    with c.override(
+        Testdrive(
+            no_reset=True,
+            seed=seed,
+            entrypoint_extra=["--initial-backoff=1s", "--backoff-factor=0"],
+        )
+    ):
+        c.testdrive(dedent("""
+                > CREATE CONNECTION kafka_conn
+                  TO KAFKA (BROKER '${testdrive.kafka-addr}', SECURITY PROTOCOL PLAINTEXT);
+
+                > CREATE CONNECTION IF NOT EXISTS csr_conn TO CONFLUENT SCHEMA REGISTRY (
+                  URL '${testdrive.schema-registry-url}'
+                  );
+
+                > CREATE TABLE t (a INT, b INT);
+
+                > CREATE SINK pk_test_sink FROM t
+                  INTO KAFKA CONNECTION kafka_conn (TOPIC 'testdrive-pk-test-${testdrive.seed}')
+                  KEY (a) NOT ENFORCED
+                  FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY CONNECTION csr_conn
+                  ENVELOPE UPSERT;
+
+                > SELECT status FROM mz_internal.mz_sink_statuses WHERE name = 'pk_test_sink'
+                running
+        """))
+
+        # PkViolationWarner rate-limits to once per 10s from construction.
+        time.sleep(12)
+
+        c.testdrive(dedent("""
+                > INSERT INTO t VALUES (1, 10), (1, 20);
+
+                $ set-sql-timeout duration=60s
+                $ kafka-verify-topic sink=materialize.public.pk_test_sink await-value-schema=true
+        """))
+
+        time.sleep(5)
+
+        logs = c.invoke("logs", "materialized", capture=True)
+        assert "primary key error" in logs.stdout, (
+            "PkViolationWarner::flush() not called after batch"
+        )

Replaces the pre-grouped VecCollection<(Option<Row>, DiffPair<Row>)> that render_sink previously handed sinks with an Arranged<SinkTrace> — each sink now walks cursors itself via for_each_diff_pair. - The shared key-extraction + arrangement lives in render_sink on sinks.rs; zip_into_diff_pairs and combine_at_timestamp are gone. - Kafka's encode_collection walks the arrangement inside its async operator, emitting a KafkaMessage per DiffPair. - Iceberg gains a small walker operator at its input that walks the arrangement into the same (Option<Row>, DiffPair<Row>) stream shape its mint / write / commit pipeline already consumed. - Per-(key, timestamp) primary-key-violation detection is lifted into a new PkViolationWarner in sinks.rs, observed incrementally as each sink walks the arrangement instead of after a Vec<DiffPair> materialization. The immediate win is skipping the Vec<DiffPair> allocation per (key, time) in combine_at_timestamp; the larger architectural win is that sinks can now make envelope- and batch-specific decisions about cursor navigation (e.g., snapshot fast paths) without touching shared plumbing.

SinkRender::render_sink now takes a Stream<Vec<SinkBatch>> instead of an Arranged<SinkTrace>. Sinks only ever walk incoming batches (via for_each_diff_pair) — they never use the TraceAgent for random access — so the reader handle is dead weight. arrange_sink_input still calls arrange_named, but immediately extracts arranged.stream and lets the surrounding Arranged (and its TraceAgent) drop. With no reader holding compaction frontiers, the arrange operator's spine can compact to the empty antichain as batches flow, releasing historical batch state instead of accumulating it. Mirrors the pattern used by DD's consolidate_named, which builds an arrangement only to call as_collection on it and drop the trace.

IcebergSink measures steady-state streaming (1 seed row in before() then INSERT of 1M rows inside benchmark()), which exercises incremental emission but not the snapshot path where the sink reads a pre-existing source from as_of. IcebergSnapshot mirrors ExactlyOnce's shape: init() pre-populates a table with 1M rows, before() generates a unique Iceberg destination per run, and benchmark() creates a fresh sink and measures until the full snapshot is committed. This is the path where the zip_into_diff_pairs allocation hotspot lived, and therefore the relevant micro-benchmark for the sink refactor.

martykulma

looks great @DAlperin - minor comments, nothing blocking.

martykulma · 2026-04-23T13:44:25Z

-    // just needed a value to use to distribute work.
-    if user_key_indices.is_none() {
-        collection = collection.map(|(_key, value)| (None, value))
+    let arranged = keyed.arrange_named::<OrdValBatcher<_, _, _, _>, RcOrdValBuilder<_, _, _, _>, OrdValSpine<_, _, _, _>>("Arrange Sink");


nit: more explicit RE: dropping the trace

let Arranged{ stream, trace: _ } = ...; stream

curious if the trace would be optimized out by the compiler in this current version, or would this nit possibly provide some perf benefit?

agreed with the nit in terms of readability, though

You should make sure to drop the trace, otherwise you risk never releasing arrangement memory. A handle to a trace needs to be downgraded to release resources.

I updated the code to make the drop clearer, but this is whats happening, no? The trace goes out of scope and is dropped. TraceAgent::drop then calls downgrade before the RC falls.

(My comment was less about the implementation and rather what should be true :) I've had some surprises in the past when things were accidentally moved into closures and whatnot.)

martykulma · 2026-04-23T14:07:33Z

+        if self.count > 1 {
+            let now = Instant::now();
+            if now.duration_since(self.last_warning) >= Duration::from_secs(10) {
+                self.last_warning = now;


if the snapshot is quick (<10s), this won't emit a warning. It looks like the previous one didn't either, so possibly not a big deal.

Yeah, stole from old implementation, not that worried about if for now

martykulma · 2026-04-23T14:16:48Z

+        let fan_out = |(t, v, cnt): (B::Time, B::ValOwn, usize)| {
+            if cnt == 1 {
+                Either::Left(iter::once((t, v)))
+            } else {
+                Either::Right(iter::repeat((t, v)).take(cnt))
            }


should do the same

std::iter::repeat_n((t,v), cnt)

patrickwwbutler · 2026-04-23T15:21:12Z

+    let keyed = match key_indices {
        None => collection.map(|row| (Some(Row::pack(Some(Datum::UInt64(row.hashed())))), row)),
        Some(key_indices) => {
            let mut datum_vec = mz_repr::DatumVec::new();


this is obviously unrelated to your changes, but im curious - why is key a whole Row of its own that needs to be packed? Do the sinks represent the key as a sort of struct column that needs to be built from the Datums in each of the key indices? It feels like it would make more sense to me to merge all the key Datums into one combined key Datum rather than its own row? I suppose we can't really precompute a key column because its entirely dependent on the sink, different sinks for the same persist collection could want different, arbitrary keys, but the TODO[perf] had me wondering

DAlperin force-pushed the dov/sink-cursor-walk branch 2 times, most recently from 264cfea to 36243b0 Compare April 21, 2026 01:02

DAlperin force-pushed the dov/sink-cursor-walk branch from c94fb74 to bc5c160 Compare April 21, 2026 18:40

DAlperin marked this pull request as ready for review April 21, 2026 21:53

DAlperin requested a review from a team as a code owner April 21, 2026 21:53

def- reviewed Apr 22, 2026

View reviewed changes

DAlperin added 3 commits April 22, 2026 08:26

DAlperin force-pushed the dov/sink-cursor-walk branch from bc5c160 to 46f9ca4 Compare April 22, 2026 12:40

DAlperin requested a review from a team as a code owner April 22, 2026 12:40

martykulma approved these changes Apr 23, 2026

View reviewed changes

Respond to review feedback!

bda54df

DAlperin force-pushed the dov/sink-cursor-walk branch from a0c2078 to bda54df Compare April 23, 2026 15:06

patrickwwbutler reviewed Apr 23, 2026

View reviewed changes

DAlperin merged commit 85fe1c5 into MaterializeInc:main Apr 23, 2026
124 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sinks: walk the input arrangement via cursor per batch#36165

sinks: walk the input arrangement via cursor per batch#36165
DAlperin merged 5 commits intoMaterializeInc:mainfrom
DAlperin:dov/sink-cursor-walk

DAlperin commented Apr 20, 2026 •

edited

Loading

Uh oh!

def- left a comment

Uh oh!

martykulma left a comment

Uh oh!

martykulma Apr 23, 2026

Uh oh!

patrickwwbutler Apr 23, 2026

Uh oh!

antiguru Apr 23, 2026

Uh oh!

DAlperin Apr 23, 2026

Uh oh!

antiguru Apr 23, 2026

Uh oh!

martykulma Apr 23, 2026

Uh oh!

DAlperin Apr 23, 2026

Uh oh!

martykulma Apr 23, 2026

Uh oh!

patrickwwbutler Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

DAlperin commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this buys

What this does not fix

Test plan

Uh oh!

def- left a comment

Choose a reason for hiding this comment

Uh oh!

martykulma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DAlperin commented Apr 20, 2026 •

edited

Loading